Lab 04a: One hot encoding

Introduction

This lab demonstrates how to apply one hot encoding to categorical variables with pandas. At the end of the lab, you should be able to use pandas to:

  • Encode categorical variables via one hot encoding.
  • Modify a data frame to substitute the newly encoded variables for the categorical labels they were generated from.

Getting started

Let's start by importing pandas in the usual way.


In [ ]:
import pandas as pd

Next, let's load the data. Write the path to your iris.csv file in the cell below:


In [ ]:
path_to_csv = "data/iris.csv"

Execute the cell below to load the data into a pandas data frame and index that data frame by the sample_number column:


In [ ]:
df = pd.read_csv(path_to_csv, index_col=['sample_number'])

Take a quick peek at the data:


In [ ]:
df.head()

One hot encoding

We can examine the type of the data in our data frame via the dtypes attribute, as follows:


In [ ]:
df.dtypes

As you can see, we have four columns of numerical data (float64), corresponding to the physical measurements, and one column of text data (object), corresponding to the species labels. Let's take a closer look at the unique values in the species column:


In [ ]:
df['species'].unique()

If we wanted to use these labels as input to a machine learning algorithm, we would first need to convert them from text into some numerical format, so that the algorithm could understand them. One way to do this would be to assign a numerical value to each species, e.g. setosa = 0, versicolor = 1, virginica = 2, but this wouldn't make a lot of sense as setosa is not "less than" versicolor or virginica in a mathematical sense.

A better alternative would be to create a set of new features that encode the values of the labels in such a way that an algorithm would view them as equal. One hot encoding is supported in pandas via the get_dummies method:


In [ ]:
encoded_features = pd.get_dummies(df['species'])

encoded_features.head()  # Take a quick look at the result

As you can see, pandas has encoded each label as a binary indicator variable, where a "1" represents the presence of the label and a "0" indicates the absence of the label.

We can use the concat method to glue the new features to our existing data frame:


In [ ]:
df = pd.concat([df, encoded_features], axis='columns')

df.head()

Finally, we can use the drop method to remove the original species column from the data frame, leaving us with the new features only:


In [ ]:
df = df.drop('species', axis='columns')

df.head()